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Abstract.  This  paper  describes  our  participation  in  the  TREC  2014 
Microblog  real-time  search  task.  We  investigate  whether  page  views  from 
Wikipedia  can  be  used  successfully  to  estimate  relevant  time  periods  for 
queries.  To  this  end,  we  use  a  recently  published  temporal  reranking 
method  by  Efron  et  al.  [2],  which  uses  kernel  density  estimation. 


1  Introduction  and  Task  Description 

In  the  Temporally-Anchored,  Ad  Hoc  Retrieval  task ,  the  user  wishes  to  search  for 
the  most  recent  and  relevant  posts.  The  task  can  be  summarized  as:  at  time  t, 
find  tweets  about  topic  X.  Therefore,  systems  should  favor  relevant  and  highly 
informative  tweets  about  the  query  topic  posted  before  the  query  time.  Due 
to  the  nature  of  microblogs,  it  is  likely  that  relevance  has  a  temporal  dimen¬ 
sion.  That  is,  relevant  tweets  are  likely  to  have  been  published  recently,  close 
to  the  time  of  the  query.  Therefore,  systems  should  also  take  into  account  the 
temporality  of  the  tweets. 

Participants  can  access  the  Tweets2013  corpus  by  issuing  text  queries  to  a 
search  API  provided  by  the  track.  Therefore  we  experimented  with  methods  to 
temporally  rerank  the  list  of  tweets  returned  using  the  search  API. 


Tweets2013  corpus  This  collection  consists  of  approximately  240  million 
tweets  (statuses),  collected  via  the  Twitter  streaming  API  by  crawling  the  pub¬ 
lic  stream  sample  over  a  two-month  period:  1  February,  2013  -  31  March,  2013 
(inclusive).  NIST  created  60  topics  based  on  this  corpus  each  representing  a  infor¬ 
mation  need  at  a  specific  point  in  time.  The  assessors  judged  the  relevance  of  the 
tweet  but  also  considered  the  relevance  of  any  URLs  linked  from  the  tweet.  All 
assessments  were  conducted  by  NIST  assessors  on  a  three-point  scale  of  “infor¬ 
mativeness”:  not  relevant,  relevant  and  highly  relevant.  The  primary  evaluation 
measure  is  MAP,  precision  at  rank  30  cutoff  and  R-prec  are  also  reported. 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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2  Approach 


We  follow  the  approach  proposed  by  Efron  et  al.  [2],  They  separated  the  lexical 
and  temporal  signals  into  two  components  following  the  views  of  Dakka  et  al.  [1] . 
To  combine  these  two  components  they  propose  the  following  log-linear  model 


log  Pa(R\D,Q)  =  (1  —  a)  log  P(R\WD,  Q)  (1) 

+  alogP(R\TD,Q)  (2) 

where,  a  can  be  tuned.  In  our  experiments  a  was  fixed  to  0.5. 

Formally,  they  use  a  standard  query-likelihood  estimate  for  P(R\Wd,  Q)-  The 
probability  of  relevance  given  a  timestamp  and  the  query  P(R\Td,  Q)  is  viewed 
as  the  distribution  of  documents  relevant  to  a  query  Q  over  time  and  thus  a 
density  /q  exists  which  can  be  estimated.  Our  approach  uses  non-parametric 
kernel  density  estimation  over  the  timestamps  of  a  rank  obtained  using  a  stan¬ 
dard  retrieval  method  with  the  corpus  [2]  as  well  as  over  an  external  signal:  the 
page  views  of  a  Wikipedia  page  associated  with  each  query. 

We  also  employ  the  RM3  [3]  method  for  pseudo-relevance  feedback  after 
reranking  using  a  temporal  model,  since  documents  from  peak  time  periods  can 
contain  more  informative  terms. 

3  Baselines  and  Official  Runs 

QL  ranks  by  the  scores  as  retrieved  from  the  track’s  Tweets2013  search  API. 
Additionally  the  system  tries  to  filter  two  classes  of  tweets  that  are  not  relevant  as 
per  track  guidelines.  Firstly,  Twitter-style  retweets  as  well  as  RT-style  retweets 
are  filtered  out.  Secondly,  tweets  not  in  the  English  language,  using  the  ldig1 
project  for  detection.  This  baseline  serves  as  a  base  for  the  other  runs  below  and 
therefore  all  runs  filter  retweets  and  tweets  not  written  in  English. 


RM3  uses  the  RM3  [3]  method  for  pseudo-relevance  feedback  without  applying 
any  temporal  reranking.  Original  terms  and  new  terms  were  set  to  the  same 
weight.  The  number  of  feedback  documents  was  set  to  50  and  feedback  terms  to 
20.  Tweet  replies  are  filtered  out  from  the  documents  set  used  for  feedback. 


NovaSearchO  estimates  the  density  of  the  distribution  of  relevant  documents 
using  KDE  over  the  timestamps  of  retrieved  documents  and  uses  it  for  reranking. 


NovaSearchl  estimates  the  density  of  the  distribution  of  relevant  documents 
using  KDE  over  the  timestamps  of  the  page  views  of  a  related  Wikipedia  page. 

NovaSearch2  combines  NovaSearchO  and  NovaSearchl  runs  with  equal  weight. 


1  https:  /  /  gitlmb.com/shuyo/ldig 


Table  1.  TREC  2014  Microblog:  Temporally- Anchored  Ad  Hoc  Retrieval  task  results. 


Run 

MAP 

R-prec 

P30 

Median 

0.4209 

0.4437 

0.6315 

QL 

RM3 

0.4268 

0.4783 

0.4566 

0.4872 

0.6345 

0.6564 

NovaRunO 

NovaRunl 

NovaRun2 

0.4836  0.4904  0.6691 
0.4786  0.4851  0.6679 

0.4873  0.4950  0.6709 

NovaRunO 


NovaRunl 


Fig.  1.  Per-query  differences  in  AP  (all  relevance  levels)  in  relation  to  the  RM3  run. 


4  Summary 


Our  results  seems  to  indicate  that  the  run  NovaRun2,  which  uses  two  sources  of 
temporal  evidence  gives  the  best  results.  In  addition,  the  results  for  NovaRunl 
show  that  page  views  from  Wikipedia  can  be  used  successfully  to  estimate  rele¬ 
vant  time  periods  for  queries.  When  used  in  reranking,  the  performance  improved 
for  some  queries,  and  deteriorated  for  others. 
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